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Abstract. We study distributions of persistent homology barcodes associated 
to taking subsamples of a fixed size from metric measure spaces. We show that 
such distributions provide robust invariants of metric measure spaces, and 
illustrate their use in hypothesis testing and providing confidence intervals for 
topological data analysis. 



1. Introduction 

Topological data analysis assigns homological invariants to data presented as a 
finite metric space (a "point cloud"). If we imagine this data as measurements 
sampled from some abstract universal space M, the structure of that space is a 
metric measure space, having a notion both of distance between points and a no- 
tion of probability for the sampling. The usual homological approach to samples is 
to assign a simplicial complex and compute its homology. The construction of the 
associated simplicial complex for a point cloud depends on a choice of scale param- 
eter. The insight of "persistence" is that one should study homological invariants 
that encode change across scales; the correct scale parameter is a priori unknown. 
As such, a first approach to studying the homology of M from the samples is to 
simply compute the persistent homology of the sampled point cloud. 

We can gain some perspective from imagining that we could make measurements 
on M directly and interpret these measurements in terms of random sample points. 
With this in mind, we immediately notice some defects with homology and persis- 
tent homology as invariants of M. While the homology of M captures information 
about the global topology of the metric space, the probability space structure plays 
no role. This has bearing even if we assume M is a compact Riemmannian manifold 
and the probability measure is the volume measure for the metric: handles which 
are small represent subsets of low probability but contribute to the homology in the 
same way as large handles. In this particular kind of example, persistent homology 
can identify this type of phenomenon (by encoding the scales at which homological 
features exist); however, in a practical context, the metric on the sample may be 
ad hoc (e.g., [3]) and less closely related to the probability measure. In this case, 
we could have handles that are medium size with respect to the metric but still 
low probability with respect to the measure. Homology and persistent homology 
have no mechanism for distinguishing low probability features from high probabil- 
ity features. A closely related issue is the effect of small amounts of noise (e.g., a 
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situation in which a fraction of the samples are corrupted). A small proportion of 
bad samples can arbitrarily change the persistent homology. These two kinds of 
phenomena are linked, insofar as decisions about whether low probability features 
are noise or not is part of data analysis. 

The disconnect with the underlying probability measure presents a significant 
problem when trying to adapt persistent homology to the setting of hypothesis 
testing and confidence intervals. Hypothesis testing involves making quantitative 
statements about the probability that the persistent homology computed from a 
sampling from a metric measure space is consistent with (or refutes) a hypothesis 
about the actual persistent homology. Confidence intervals provide a language to 
understand the variability in estimates introduced by the process of sampling. Be- 
cause low probability features and a small proportion of bad samples can have a 
large effect on persistent homology computations, the persistent homology groups 
make poor test statistics for hypothesis testing and confidence intervals. To obtain 
useable test statistics, we need to develop invariants that better reflect the under- 
lying measure and are less sensitive to large perturbation. To be precise about this, 
we use the statistical notion of robustness. 

A statistical estimator is robust when its value cannot be arbitrarily perturbed 
by a constant proportion of bad samples. For instance, the sample mean is not 
robust, as a single extremely large sample value can dominate the result. On the 
other hand, the sample median is robust. As we discuss in Section [4j persistent 
homology is not robust. A small number of bad samples can cause large changes 
in the persistent homology, essentially as a reflection of the phenomenon of large 
metric low probability handles (including spurious ones). 

Using the idea of an underlying metric measure space M, formally the process of 
sampling amounts to considering random variables on the probability space M n = 
M x • • • x M equipped with the product probability measure. The fc-th persistent 
homology of a size n sample is a random variable on M n taking values in the set B 
of finite barcodes [23j . where a barcode is essentially a multiset of intervals of the 
form [a, b). The set B of barcodes is equipped with a metric, the bottleneck metric 
[7] , and we show in Section|3] that it is separable and that its completion B is also 
a space of barcodes. Then B is Polish, i.e., complete and separable, which makes it 
amenable to probability theory (see also |18| for such results). In particular, various 
metrics on the set of distributions on B metrize weak convergence, including the 
Prohorov metric dp r and the Wasserstein metric dyy- We consider the following 
probability distribution on barcode space B: 

Definition 1.1. For a metric measure space (X, dx,Hx) and fixed n, k S N, define 
the fcth n-sample persistent homology as 

the probability distribution on the set of barcodes B induced by pushforward along 
PHj. from the product measure \i x on X n . 

In other words, <&J? is the probability measure on the space of barcodes where 
the probability of a subset A is the probability that a size n sample from M has 
fc-th persistent homology landing in A. 

Although complicated, <!>}J(M) is a continuous invariant of M in the following 
sense. The moduli space of metric measure spaces admits a metric (in fact sev- 
eral) that combine the ideas of the Gromov-Hausdorff distance on compact metric 
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spaces and weak convergence of probability measures [22] ■ We follow [TT] , and use 
the Gromov-Prohorov metric, dQp r . We prove the following theorem in Section [5] 



(where it is restated as Theorem 5.2 ) 



Theorem 1.2. Let (X, dx, l^x) and (Y, dy, (J-y) be compact metric measure spaces. 
Then we have the following inequality relating the Prohorov and Gromov-Prohorov 
metrics: 

d Pr ($%(X, d X ,Hx), $k(Y, dy, (Xy)) < nd GPr ({X, d X ,»x), (Y, dy, (ly)). 

As a consequence of the continuity implied by the previous theorem, we can use 
to develop robust statistics: If we change M by adjusting the metric arbitrarily 
on e probability mass to produce M' , then the Gromov-Prohorov distance satisfies 
d GPr (M,M') < e. 

A first question that arises is how to interpret $5? in practice, where we are given 
a large finite sample S which we regard as drawn from M. Making S a metric 
measure space via the subspace metric from M and the empirical measure, we can 
compute < &JJ(S') as an approximation to $}?(M). This procedure is justified by the 
fact that as the sample size increases, the empirical metric converges almost surely 



in d GPr to M; see Lemma 5.4 (This kind of approximation is intimately connected 
with resampling methodology, a topic we study in the paper pQ.) 

A problem with is that it can be hard to interpret or summarize the infor- 
mation contained in a distribution of barcodes, unlike distributions of numbers for 
which there are various moments (e.g., the mean and the variance) which provide 
concise summaries of the distribution. One approach is to develop "topological 
summarizations" of distributions of barcodes, a subject we pursue in the paper [2]. 
In this paper, we instead consider cruder invariants which take values in R. One 
such invariant is the distance with respect to a reference distribution on barcodes 
V, chosen to represent a hypothesis about the persistent homology of M. 

Definition 1.3. Let (X, dx,Hx) be a compact metric measure space and let V be 
a fixed reference distribution on B. Fix k,n E N. Define the homological distance 
on X relative to V to be 

KD%((X,dx,(Xx),V) = d Pr (n(X,dx,lxx),P). 

We also produce a robust statistic MHD^ related to HD£ without first computing 
the distribution "Fj?. To construct MHDj!, we start with a reference bar code and 
compute the median distance to the barcodes of subsamples. 

Definition 1.4. Let {X, dx,Hx) be a compact metric measure space and let V be 
a fixed reference barcode B g B. Fix k, m G N. Let T> denote the distribution on K 
induced by applying ds(B, — ) to the barcode distribution <££(A, dx, f-x)- Define 
the median homological distance relative to V to be 

MHD£((X, d x ^x),V) = median(P). 

The use of the median rather than the mean in the following definition ensures 
that we compute a robust statistic. To be precise, for functions from finite metric 
spaces to metric spaces, we use the following definition of robustness. 

Definition 1.5. Let / be a function from finite metric spaces to a metric space 
(B,d). We say that / is robust with robustness coefficient r > if for any non-empty 
finite metric space (X, d) , there exists a bound S such that for any isometry of X 
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into a finite metric space (X',d'), \X'\/\X\ < 1+r implies d(f(X, 8), f(X', 8 1 )) < S, 
where |X| denotes the number of elements of X . 

For example, under the analogous definition on finite multi-subsets of K (in 
place of finite metric spaces), median defines a function to K that is robust with 
robustness coefficient 1 — e for any e since expanding a multi-subset X to a larger 
one X' with fewer than twice as many elements will not change the median by more 
than the diameter of X. Similarly, for a finite metric space X, expanding X to X' , 
the proportion of n-element samples of X' which are samples of X is (|X|/|X'|) n ; 
when this number is more than 1/2, the median value of any function / on the 
set of n-element samples of X' is then bounded by the values of / on n-element 
samples of X. Since (N/(N + rN)) n > 1/2 for r < 2 1 /™ - 1, any such function / 
will be robust with robustness coefficient r satisfying this bound, and in particular 
for r = (ln2)/rc. 

Theorem 1.6. For any n, k, V , the junction MHD)!(— , V) from finite metric spaces 
(given the uniform probability measure) to K is robust with robustness coefficient 
> (ln2)/n. 

The function $JJ from finite metric spaces to distributions on B and the function 
HDJJ from finite metric spaces to R are robust for any robustness coefficient for 
trivial reasons since the Gromov-Prohorov metric is bounded. However, for these 
functions we can give explicit uniform estimates for how much these functions 
change when expanding X to X' just based on |X'|/|X|. We introduce the following 
notion of uniform robustness strictly stronger than the notion of robustness. 

Definition 1.7. Let / be a function from finite metric spaces to a metric space 
(B,d), We say that / is uniformly robust with robustness coefficient r > 
and estimate bound S if for any non-empty finite metric space (X, d) and any 
isometry of (X, d) into a finite metric space [X 1 ,&), \X'\/\X\ < 1 + r implies 
d(f(X,8)J(X',d'))<6. 

Uniform robustness gives a uniform estimate on the change in the function from 
expanding the finite metric space. For example, the median function does not 
satisfy the analogous notion of uniform robustness for functions on finite multi- 
subsets of M. We show in Section [5] that <&J? and HD}? satisfy this stronger notion 
of uniform robustness. 

Theorem 1.8. For fixed n,k, is uniformly robust with robustness coefficient r 
and estimate bound nr / (1 + r) for any r . For fixed n,k,V, HD£(— ,V) is uniformly 
robust with robustness coefficient r and estimate bound nr/(l + r) for any r. 

As with itself, the law of large numbers and the convergence of empirical 
metric measure spaces tells us that given a sufficiently large finite sample S C M, 
we can approximate HD)J and MHDJJ of the metric measure space M in a robust 



fashion from the persistent homology computations from S. (See Lemmas 5.4 6.5 



and 6.8 below.) 

In light of the results on robustness and asymptotic convergence, HD, MHD, and 
$ (as well as various distributional invariants associated to $) provide good test 
statistics for hypothesis testing. Furthermore, one of the benefits of the numerical 
statistics HD£ and MHD)! is that we can use standard techniques to obtain confi- 
dence intervals, which provide a means for understanding the reliability of analyses 
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of data sets. We discuss hypothesis testing and the construction of confidence in- 
tervals in Section [6j and explore examples in Sections [7] and [8j In this paper we 
primarily focus on analytic methods which use asymptotic normality results; how- 
ever, these statistics are well-suited for the construction of resampling confidence 
intervals. In the follow-up paper [1] we establish the asymptotic consistency of the 
bootstrap for HD£ and MHD£. 

We regard this paper as a first step towards providing a foundation for the inte- 
gration of standard statistical methodology into computational algebraic topology. 
Our goal is to provide tools for practical use in topological data analysis. In a sub- 
sequent paper pQ , we provide theoretical support for the use of resampling methods 
to understand the distributional invariants HD£, and MHDJ!. We also study 
elsewhere topological summarizations in order to understand distributions of bar- 
codes [2J. 

The paper is organized as follows. In Section [2j we provide a rapid review 
of the necessary background on simplicial complexes, persistent homology, and 
metric measure spaces. In Section [3j we study the space of barcodes, establishing 
foundations needed to work with distributions of barcodes. In Section|4j we discuss 
the robustness of persistent homology. In Section [5j we study the properties of 
MHD£, and HD£ and prove Theorem 1.2 We discuss hypothesis testing and 



confidence intervals in Section [6j which we illustrate with synthetic examples in 
Section [7] Section [8] applies these ideas to the analysis of the natural images data 
in 0. 

2. Background 

2.1. Simplicial complexes associated to point clouds. Computational alge- 
braic topology proceeds by assigning a simplicial complex (which usually also de- 
pends on a scale parameter e) to a finite metric space (X, d). Recall that a simplicial 
complex is a combinatorial model of a topological space, defined as a collection of 
nonempty finite sets Z such that for any set Z G Z, every nonempty subset of 
Z is also in Z. Associated to such a simplicial complex is the "geometric realiza- 
tion", which is formed by gluing standard simplices of dimension \Z\ — 1 via the 
subset relations. (The standard n-simplex has n + l vertexes.) The most basic and 
widely used construction of a simplicial complex associated to a point cloud is the 
Vietoris-Rips complex: 

Definition 2.1. For e 6 K, e > 0, the Vietoris-Rips complex VK e (X) is the 
simplicial complex with vertex set X such that [vo, Vi, . . . , v n ] is an n-simplex when 
for each pair «j, Vj, the distance d(vi, Vj) < e. 

Observe that the Vietoris-Rips complex is determined by its 1-skclcton. The 
construction is functorial in the sense that for a continuous map / : X — > Y with 
Lipshitz constant k and for e < e', there is a commutative diagram 

YR e (X) ^VR K£ (V) 

(2.2) 

YR e ,{X) >VR B ,(y). 

The Vietoris-Rips complex is easy to compute, but can be unmanageably large; 
for a set of points Y — {2/1,2/2, • ■ ■ , Un} such that d{y il yj) < e, every subset of Y 
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specifics a simplex of the Vietoris-Rips complex. More closely related to classical 
constructions in algebraic topology is the Cech complex. 

Definition 2.3. For e € R, e > 0, the Cech complex C e (X) is the simplicial complex 
with vertex set X such that [vq, v\, . . . , v n ] is an n-simplex when the intersection 

0<j<n 

is non-empty, where here B r (x) denotes the r-ball around x. 

The Cech complex has analogous functoriality properties to the Vietoris-Rips 
complex. In Euclidean space, the topological Cech complex associated to a cover 
of a paracompact topological space satisfies the nerve lemma: if the cover consists 
of contractible spaces such all finite intersections are contractible or empty, the 
resulting simplicial complex is homotopy equivalent to the original space. 

Remark 2.4. It is also often very useful to define complexes with the vertices re- 
stricted to a small set of landmark points; the weak witness complex is perhaps 
the best example of such a simplicial complex [3T]. We discuss this construction 
further in Section [HJ as it is important in the applications. 

The theory we develop in this paper is relatively insensitive to the specific details 
of the construction of a simplicial complex associated to a finite metric space (and 
scale parameter) . For reasons that will become evident when we discuss persistence 
in Subsection |2.3| below, the only thing we require is a procedure for assigning a 



complex to ((M, <9),e) that is functorial in the vertical maps of diagram (2.2 1 for 

K=l. 

2.2. Homological invariants of point clouds. In light of the previous subsec- 
tion, given a finite metric space (X, d), one defines the homology at the feature scale 
e to be the homology of a simplicial complex associated to (X, d); e.g., H*(VR e (X)) 
or H*(C e (X)). This latter definition is supported by the following essential consis- 
tency result, which is in line with the general philosophy that we are studying an 
underlying continuous geometric object via finite sets of samples. 

Theorem 2.5 (Niyogi-Smale- Weinberger 19J). Let (M,d) be a compact Riemann- 
ian manifold equipped with an embedding j: M — > R", and let X C M be a finite 
sample drawn according to the volume measure on M . Then for any p € (0, 1), there 
are constants S ( which depends on the curvature of M and the embedding 7 ) and 
Ns^p such that if e < S and \X\ > Ng^ p then the probability that H*{C e {X)) = H*(M) 
is an isomorphism is > p. 

In fact, Niyogi, Smale, and Weinberger prove an effective version of the previous 
result, in the sense that there are explicit numerical bounds dependent on p and 
a "condition number" which incorporates data about the curvature of M and the 
twisting of the embedding 7. 

Work by Latschev provides an equivalent result for VR e (W), with somewhat 
worse bounds, defined in terms of the injectivity radius of M |16j . Alternatively, one 
can obtain consistency results for VR £ (W) using the fact that there are inclusions 

C e (X) C VR e (X) C C 2e (X). 

While reassuring, an unsatisfactory aspect of the preceding results is the depen- 
dence on a priori knowledge of the feature scale e and the details of the intrinsic 
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curvature of M and the nature of the embedding. A convenient way to handle 
the fact that it is often hard to know a good choice of e at the outset is to con- 
sider multi-scale homological invariants that encode the way homology changes as 
e varies. This leads us to the notion of persistent homology [10] . 

2.3. Persistent homology. Given a diagram of simplicial complexes indexed on 
N (i.e., a direct system), 



there are natural inclusions fl*(JQ) — > H*(Xj) for i < j. 

We say that a class a e H p (Xi) is born at time i if it is not in the image of 
Hk(Xj) for j <i, and we say a class a € iffc(Xj) dies at time I if the image of a is 
in Hh(Xj) for j > i. This information about the homology can be packaged up 
into an algebraic object: 

Definition 2.6. Let {X{\ be a direct system of simplicial complexes. The pth 
persistent fcth homology group of Xj is defined to be 



where Z and B denote the cycle and boundary groups respectively. Alternatively, 
Hk, P (Xi) is the image of the natural map 



When working over a field and in the presence of suitable finiteness hypotheses, 
barcodes provide a convenient reformulation of information from persistent homol- 
ogy. Specifically, assume that the direct system of simplicial complexes stabilizes 
at a finite stage and all homology groups are finitely-generated. Then a basic clas- 
sification result of Zomorodian-Carlsson 23J describes the persistent homology in 
terms of a barcode, a multiset of non-empty intervals of the form [a, b) C R. An in- 
terval in the barcode indicates the birth and death of a specific homological feature. 
For reasons we explain below, the barcodes appearing in our context will always 
have finite length intervals. 

The Rips (or Cech) complexes associated to a point cloud (A, dx) fit into this 
context by looking at a sequence of varying values of e: 



We can do this in several ways, for example, using the fact that the Vietoris-Rips 
complex changes only at discrete points {ei} and stabilizes for sufficiently large e, 
or just choosing and fixing a finite sequence independently of X. The theory 
we present below makes sense for either of these choices, and we use the following 
notation. 

Notation 2.7. Let (A, dx) be a finite metric space. For k £ N, denote the persis- 
tent homology of A as 




H k , p (x, i ) = zi/(Bl +p nzi), 



Hk(Xi) -> H k (X i+p ). 



VR ei (A) -> VR £2 (A) 




More generally, we can make analogous definitions for any functor 



x K>o sComp, 
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where M. is the category of finite metric spaces and metric maps and sComp denotes 
the category of simplicial complexes. We will call such a ^ "good" when the 
homotopy type changes for only finitely values in R. In this case, we can choose 
the directed system of values of et to contain these transition values. 

We note that for large values of the parameter e, VR t (X) will be contractible. 
Therefore, if we use the reduced homology group in dimension 0, we get Hk(VR e ) = 
for all k for large e. The bar codes associated to these persistent homologies 
therefore have only finite length bars. For convenience in computation, we typically 
cut off e at a moderately high value before this breakdown occurs. The result is a 
truncation of the bar code to the cut-off point. 

2.4. Gromov-Hausdorff stability and the bottleneck metric. By work of 
Gromov, the set of isometry classes of finite metric spaces admits a useful metric 
structure, the Gromov-Hausdorff metric. For a pair of finite metric spaces (X\, d\) 
and (X2,£?2), the Gromov-Hausdorff distance is defined as follows: For a compact 
metric space (Z, d) and closed subsets A, B C Z, the Hausdorff distance is defined 
to be 

dfj(A,B) = max(sup inf 9(a,6),sup inf d(a,b)). 

aeAbeB b£B a£A 

One then defines the Gromov-Hausdorff distance between Xi and X 2 to be 
d GH (X 1 ,X 2 ) = inf d%(X u X 2 ), 

where here 71 : X\ —¥ Z and 72 : X2 — > Z are isometric embeddings. 

Since the topological invariants we are studying ultimately arise from finite met- 
ric spaces, a natural question to consider is the degree to which point clouds that 
are close in the Gromov-Hausdorff metric have similar homological invariants. This 
question does not in general have a good answer in the setting of the homology of 
the point cloud, but in the context of persistent homology, Chazal, et al. [6| 3.1] 
provide a seminal theorem in this direction that we review as Theorem |2.9| below. 

The statement of Theorem 12.91 involves a metric on the set of barcodes called 
the bottleneck distance and defined as follows. Recall that a barcode {I a } is a 
multiset of non-empty intervals. Given two non-empty intervals I\ = [a\,bi) and 
I2 = [ a 2i b 2 ), define the distance between them to be 

dooihih) = \\(ai,bi) - (a 2 ,&2)||oo = maxflai - a 2 |, |&i - b 2 \). 
We also make the convention 

cUM),0) = \b-a\/2 

for b > a and doo(0, 0) = 0. For the purposes of the following definition, we define 
a matching between two barcodes B\ = {I a } and B% = {J/3} to be a multi-subset 
C of the underlying set of 

{B 1 U {0}) x (B 2 U {0}) 

such that C does not contain (0, 0) and each element I a of Bi occurs as the first 
coordinate of an element of C exactly the number of times (counted with multiplic- 
ity) of its multiplicity in B\, and likewise for every element of B 2 . We get a more 
intuitive but less convenient description of a matching using the decomposition of 
(B\ U {0}) x (B 2 U {0}) into its evident four pieces: The basic data of C consists of 
multi-subsets A\ C B\ and A 2 C B 2 together with a bijection (properly accounting 
for multiplicities) 7: A\ — > A 2 ; C is then the (disjoint) union of the graph of 7 
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viewed as a multi-subset of B\ xB 2 , the multi-subset [B\ — Ay) X {0} of B\ x {0}, 
and the multi-subset {0} x (B 2 — A 2 ) of {0} x B 2 . With this terminology, we can 
define the bottleneck distance. 

Definition 2.8. The bottleneck distance between barcodes B\ = {I a } and B 2 — 
Uf}} is 

d B (B ll B 2 ) = inf sup dea{I,J), 
c (i,.j)ec 

where C varies over all matchings between B\ and B 2 . 

Although expressed slightly differently, this agrees with the bottleneck metric as 
defined in §3.1] and [SJ §2.2]. On the set of barcodes B with finitely many finite 
length intervals, is obviously a metric. More generally, for any p > 0, one can 
consider the £ p version of this metric, 

dB <P (Bi,B 2 ) = inf ( <W/,J) P ) 1/P . 

(J,j)ec 

For simplicity, we focus on d& in this paper, but analogues of our main theorems 
apply to these variant metrics as well. 

We have the following essential stability theorem: 

Theorem 2.9 (Chazal, et. al. [6, 3.1]). For each k, we have the bound 

d B {PR k (X),PK k (Y)) < d GH (X,Y). 

Note that truncating bar codes is a Lipshitz map B — » B with Lipshitz constant 
1, so the bound above still holds when we use a large parameter cut-off in defining 
PH fc . 

2.5. Metric measure spaces and the Gromov-Prohorov distance. To es- 
tablish more robust convergence results, we work with suitable metrics on the set 
of compact metric measure spaces. Specifically, following [TTJ [TT], H2] we use the 
idea of the Gromov-Hausdorff metric to extend certain standard metrics on distri- 
butions (on a fixed metric measure space) to a metric on the set of all compact 
metric measure spaces. 

A basic metric of this kind is the Gromov-Prohorov metric [TT]. (For the fol- 
lowing formulas, see Section 5 of [TTJ and its references.) This is defined in terms 
of the standard Prohorov metric dp r (metrizing weak convergence of probability 
distributions) as 

dGPr{(X,d x ,fJ-x), {Y,d Y ,^Y)) = , inf d^ dz \((j)x)*^x, {<I>y)*Hy), 

(4>x ,4>y ,z) 

where the inf is computed over all isometric embeddings 4>x '■ X — > Z and 0y : Y — > 
Z into a target metric space (Z,dz)- 

It is very convenient to reformulate both the Gromov-Hausdorff and Gromov- 
Prohorov distances in terms of relations. For sets X and Y, a relation R C X x Y is a 
correspondence if for each x € X there exists at least one y £ Y such that (x,y) £ R 
and for each y' £ Y there exists at least one x' £ X such that (x',y') £ R. For a 
relation R on metric spaces (X, dx) and (Y, dy), we define the distortion as 

dis(.R) = sup \d x {x,x') - d Y (y,y')\- 
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The Gromov-Hausdorff distance can be expressed as 

d GH {{x,dx),{Y,d Y )) = iurfdis(#), 

where we are taking the infimum over all correspondences R d X x Y . 

Similarly, we can reformulate the Prohorov metric as follows. Given two measures 
Hi and /Lt2 on a metric space X, let a coupling of fii and fi2 be a measure if) on 
X x X (with the product metric) such that ip(X x — ) = and ip(— x X) = /zi. 
Then we have 

dp r ((ii, (iv) = inf inf{e > | V{(x,a/) € X x X\d x {x,x') > e} < e}. 

i> 

This characterization of the Prohorov metric turns out to be useful when work- 
ing with the Gromov-Prohorov metric in light of the (trivial) observation that if 
dGP r ((l", dx, Hx )j {Y, dy , (J-y)) < e then there exists a metric space Z and embed- 
dings i\ : X — > Z and t2 : F — > Z such that dp r ((Li) t fix, (1-2)* Vy) < £• 

3. Probability measures on the space of barcodes 

This section is introduces the spaces of barcodes Bn and B used in the distri- 
butional invariants of Definition |1.1| These spaces are complete and separable 
under the bottleneck metric. This implies in particular that the Prohorov metric 
on the set of probability measures in Bn or B metrizes convergence in probability, 
which justifies the perspective in the stability theorem [O] and the definition of the 



invariants HD^(— ,V) in Definition 1.3 



A barcode is by definition a multi-set of intervals, in our case of the form [a, b) 
for < a < b < oo. The set 1 of all intervals of this form is of course in bijective 
correspondence with a subset of R 2 . A multi-set A of intervals is a multi-subset of 
X, which concretely is a function from I to the natural numbers N = {0, 1,2,3,...} 
which counts the number of multiples of each interval in A. We denote by \A\ the 
cardinality of A, which we define as the sum of the values of the function 1 — >• N 
specified by A (if finite, or countably or uncountably infinite, if not). The space 
B of barcodes of the introduction is the set of multi-sets of intervals A such that 
\A\ < oo. We have the following important subsets of B 

Definition 3.1. For TV > 0, let Bn denote the set of multi-sets of intervals (in X) 
A with \A\ < N. 

The main result on Bn is the following theorem, proved below. (Similar results 
can also be found in [18].) 

Theorem 3.2. For each N > 0, Bn is complete and separable under the bottleneck 
metric. 

Since the homology (with any coefficients) of any complex with n vertices 
can have rank at most ( fc ? , our persistent homology barcodes will always land in 
one of the Bn, with N depending just on the size of the samples. As we let the size 
of the samples increase, N may increase, and so it is convenient to have a target 
independent of the number of samples. The space B = (J Bn is clearly not complete 
under the bottleneck metric, so we introduce the following space of barcodes B. 

Definition 3.3. Let B be the space of multi-sets A of intervals (in I) with the 
property that for every e > the cardinality of the multi-subset of A of those 
intervals of length more than e has finite cardinality. 
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Clearly barcodes in B have at most countable cardinality, and the bottleneck 
metric extends to a function dg : B x B — > K. A straightforward greedy argument 
shows that for I,F e 5, dig(X,Y) — only if X = Y, and so ds extends to a 
metric on B. We prove the following theorem. 

Theorem 3.4. B is the completion of B — [JBn in the bottleneck metric. In 
particular B is complete and separable in the bottleneck metric. 



Proof of Theorems 3.2 and 3.4 The multi-sets of intervals with rational endpoints 
provides a countable dense subset for Bn- To see that B is dense in B, given A in 
B and e > 0, let A € be the multi-subset of A of those intervals of length > e. Then 
by definition of B, A e is in B, and by definition of the bottleneck metric, using the 
matching coming from the inclusion of A e in A, we have that 

d B (A,A e ) < e/2 < e. 

It just remains to prove completeness of Bn and B. For this, given a Cauchy 
sequence (X n ) in B it suffices to show that X n converges to an element X in B and 
that X is in Bn if all the X n are in Bn- 

Let (X n ) be a Cauchy sequence in B. By passing to a subsequence if necessary, 
we can assume without loss of generality that for n 7 m > k, djg(X m , X n ) < 2~( k+2 \ 
For each n, we have dg(X„, X n+ i) < 2~( n+1 '; choose a matching C n such that 
doo{I,J) < 2~(™ +1 ) for all (I, J) G C. For each n, define a finite sequence of 
intervals I£ inductively as follows. Let fco = 0- Let k\ be the cardinality of 

the multi-subset of X\ consisting of those intervals of length > 1, and let I\ ,. . . , If. 
be an enumeration of those intervals. By induction, 1% is an enumeration 

of the intervals in X n of length > 2~™ +1 such that for i < the intervals I™ -1 

and If correspond under the matching C„_i. For the inductive step, we note that 
if If corresponds to J under C„, then doo(J™, J) < 2~( n+1 \ so the length || J|| of 
J is bigger than ||/™|| — 2~™, and 

||J|| > 2~ n+1 - T n = T n = 2-(' l+1 ) +1 . 

Thus, we can choose I" +1 to be the corresponding interval J for i < k n , and we can 
choose the remaining intervals of length > 2~(™ +1 ' +1 in an arbitrary order. Write 
If = [af,bf) and let 

ai — lim a™, bi = lim 6™. 

n— >oo n— 5-oo 

Since |a™ - < 2~( n+1 ) and \bf - b? +1 \ < 2~ < - n+1 \ we have 

|a? - Oil < 2-», \bf-h\<2- n . 

Let X be the multi-subset of I consisting of the intervals I; = [ai,bi) for all i (or 
for all i < maxfe„ if {k n } is bounded). 

First, we claim that X is in B. Given e > 0, choose N large enough that 2~ N+2 < 
e. Then for i > fcjv, the interval Ii first appears in X ni for some n,i > N. Looking at 
the matchings Cjv,. ■ • , C ni _i, we get a composite matching C^.m between Xn and 
X ni . Since each C n satisfied the bound 2~( n+1 \ the matching Cjv,n 4 must satisfy 
the bound 

m-i 

2-(™ +1 ) =2~ N -2~ n \ 
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Since all intervals of length > 2 N+1 in Xn appear as an Ij , we must have that 
the length of /"* in X ni must be less than 

Since each endpoint in Ii differs from the endpoint of by at most 2 _n % the 
length of h can be at most 

2 -a+2 _ 2 -n ( +i + 2 . 2 -»h = 2- w + 2 < e. 

Thus, the cardinality of the multi-subset of X of those intervals of length > e is at 

most fcjv- 

Next we claim that (X n ) converges to X. We have a matching of X n with X given 
by matching the intervals I£ in X n with the corresponding intervals Ii,. . . , 

Ik n in X. Our estimates above for |a" — Oj| and |&™ — 6j| show that doo(I™,Ii) < 
2~ n . By construction, each leftover interval in X n has length < 2~™ +1 and the 
previous paragraph shows that each leftover interval in X has length < 2~™ +2 . 
Thus, d B {X n ,X) < 2- n+1 . 

Finally we note that if each X n is in Bn for fixed N, then each k n < N and so 
X is in Bn- □ 

4. Failure of robustness 

Inevitably physical measurements will result in bad samples. As a consequence, 
we are interested in invariants which have limited sensitivity to a small proportion 
of arbitrarily bad samples. Many standard invariants not only have high sensitivity 
to a small proportion of bad samples, but in fact have high sensitivity to a small 
number of bad samples. We use the following terminology. 

Definition 4.1. A function / from the set of finite metric spaces to K is fragile 
if there exists a constant k such that for every non-empty finite metric space X 
and constant N there exists a metric space X' and an isometry X — > X' such that 
\X'\ < |A|+fcand \f(X')-f(X)\>N. 



Informally, fragile in Definition |4.1| means that adding a small constant number 
of points to any metric space can arbitrarily change the value of the invariant. In 
particular, a fragile function is not robust in the sense of Definition |1.4| for any 
robustness coefficient r > 0, but fragile is much more unstable than just failing to 
be robust (note the quantifier on the space X). As we indicated in the introduction, 
Gromov-Hausdorff distance is not robust; here we show it is fragile. 

Proposition 4.2. Let (Z,dz) be a non-empty finite metric space. The function 
dcii{Z,—) is fragile. 

Proof. Given N > 0, consider the space X' which is defined as a set to be the 
disjoint union of X with a new point w, and made a metric space by setting 

d(w, x) = a, x G X, 

d(xi 1 x 2 ) = d x (xi,x 2 ), Xi,x 2 eX, 

where a > diam(Z) + 2d GH {Z, X) + 2N. We claim 

\d GH (Z,X)-d GH (Z,X')\ >N. 

Given any metric space (Y,dy) and isometries /: X' — > Y, g: Z — » Y, we need 
to show that d Y (f{X'),g(Z)) > N + d GH (Z,X). We have two cases. First, if no 
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point z of Z has d Y (g(z), f(w)) < N + d GH (Z,X), then we have d Y (f(X'),g(Z)) > 
N + daH(Z,X). On the other hand, if some point z of Z has d Y (g(z), f(w)) < 
N + d G n(Z, X), then every point z in Z satisfies d Y (g(z), f(w)) < N + diam(Z) + 
doH(Z,X). Choosing some x in X, we see that for every z in Z, d Y (f(x), g(z)) > 
a-(N + diam(Z) + d GH (Z, X)). It follows that 

d Y (f(X'), g{Z)) >a-(N + diam(Z) + d GH (Z, X)) > N + d GH (Z, X). □ 

The homology and persistent homology of a point cloud turns out to be a some- 
what less sensitive invariant. Nonetheless, a similar kind of problem can occur. It 
is instructive to consider the case of H or PH . By adding £ points far from the 
original metric space X, one can change either Hq or PHo by rank £. The further 
the distance of the points, the longer the additional bars in the bar code and we 
see for example that the distance dg(B,—) in the bottleneck metric from any fixed 
bar code B is a fragile function. (If we are truncating the bar codes, is bounded 
by the length of the interval we are considering, so technically is robust, but not in 
a meaningful way.) We can also consider the rank of H or of PH in a range; here 
the distortion of the function depends on the number of points, but we see that the 
function is not robust. 

For Hk and PHfc, k > 0, the same basic idea obtains: we add small spheres 
sufficiently far from the core of the points in order to adjust the required homology. 
We work this out explicitly for PHi. 

Definition 4.3. For each integer k > and real £ > 0, let the metric circle Si i 
denote the metric space with k points {x{\ such that 

d(xi,Xj) = £ (minfli - j\, \k-i- j\)) . 

For e < £, the Rips complex associated to Si e is just a collection of disconnected 
points. It is clear that as long as k > 4, when £ < e < 2£, \R e (Sl e )\ has the 
homotopy type of a circle. In fact, we can say something more precise: 

Lemma 4.4. For 

~k~ 



3 



£ < e < 

the rank of Hi(R e (Sl e )) is at least 1. 

Proof. Consider the map / from R e (Sl e ) to the unit disk D 2 in R 2 that sends Xi to 
(cos(27r^), sin(27r^)) and is linear on each simplex. The condition e < \^~\£ precisely 
ensures that whenever {x^, . . . ,Xi n } forms a simplex a in the Rips complex, the 
image vertices f{x il ),...,f{x in ) lie on an arc of angle < §7r on the unit circle, 
and so /(cr) in particular lies in an open half plane through the origin. It follows 
that the origin (0, 0) is not in the image of any simplex, and / defines a map from 
R e (Sl e ) to the punctured disk D 2 — {(0.0)}. Since £ < e, we have the 1-cycle 

[x 1 ,x 2 ] H h [x k -i,x k ] + [xk,xi] 

of R e (Sl e ) which maps to a 1-cycle in D 2 — {(0,0)} representing the generator of 
H^D 2 -{0,0}). □ 

The length £ and number k > 4 is arbitrary, so again, we conclude that func- 
tions like ds{B, PH^— )) are fragile. Results for higher dimensions (using similar 
standard equidistributed models of n-spheres) are completely analogous. 
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Proposition 4.5. Let B be a barcode. The functions ds(B, PHfe(— )) from finite 
metric spaces to M are fragile. 

In terms of rank, the lemma shows that we can increase the rank of first persistent 
homology group of a metric space X on an interval [a, b] by m simply by adding 
extra points. One can also typically reduce persistent homology intervals by adding 
points "in the center" of the representing cycle. It is somewhat more complicated 
to precisely analyze the situation, so we give a representative example: Suppose the 
cycle is represented by a collection of points {x{\ such that the maximum distance 
d(xi,Xj) < S. Then adding a point which is a distance i5 from each of the other 
points reduces the lifetime of that cycle to 5. In any case, the results of the lemma 
are sufficient to prove the following proposition. 

Proposition 4.6. The function that takes a finite metric space to the rank of 



PHfe on a fixed interval [a,b] is not robust (in the sense of Definition 1.4) for any 
robustness coefficient r. 

These computations suggest a problem with the stability of the usual invariants 
of computational topology. A small number of bad samples can lead to arbitrary 
changes in these invariants. 

5. The main definition and theorem 

Fix a good functorial assignment of a simplicial complex to a finite metric space 
and a scale parameter e. Recall that we write PH^ of a finite metric space to denote 
the persistent homology of the associated direct system of complexes. Motivated 
by the concerns of the preceding section, we make the following definition. 

Definition 5.1. For a metric measure space (X, dx,Hx) and fixed n, k £ N, define 
the fcth n-sample persistent homology as 

$UX,d x , t i x ) = (PK k )*( (J , x n ), 

the probability distribution on B induced by pushforward along PH^ from the 
product measure (x x on X n . 

The goal of this section is to prove the main theorem. For this (and in the 
remainder of the section), we assume that we are computing PH using the Rips 
complex. 

Theorem 5.2. Let (X, dx, (J-x) o-nd (Y, dy, /iy) be compact metric measure spaces. 
Then we have the following inequality: 

d Pr (X, d X , fix) , (Y, dy , H Y ) ) < n d GPr ( (X, d X , » X ) , (Y, dy , fi Y ) ) . 

Proof. Assume that dcp r ((X, dx, Ma), (Y, 9y, hy)) < £■ Then we know that there 
exist embeddings t-i : X Z and L2 : Y —> Z into a metric space Z and a coupling 
fx between (li)*(j,x and {l2)*^y such that the probability mass of the set of pairs 
(z, z') under /t such that dz(z, z') > e is less than e. 

We can regard the restriction of fi® n to the full measure subspace (X x Y) n 
of [Z x Z) n as a probability measure on X n x Y n . This then induces a coupling 
between PH/^/i®™) and PHj,(/i® n ) on B, which we now study. Consider n sam- 
ples {(2:1,2/1), (X2,U2), ■ ■ ■ , (x n ,y n )} from Z x Z drawn according to the product 
distribution fi® n . Now consider the probability that 

a= sup \d x (xi,Xj) - dy(yi,yj)\ > 2e. 

l<z,j<n 
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The triangle inequality implies that 

\d x (xi,Xj) - d Y (yi,yj)\ < sup(d z (x i ,x j ) ) d z (y i ,y j )) < d z (x i ,y l ) + d z (x J ,y J ). 
Therefore, the union bound implies that the probability that a > 2e is bounded by 

n 

^2Pr{d z (x l ,y l ) > e) < ne. 

i=i 

Next, define a relation R that matches Xi and j/j. By definition, the distortion of 
this relation is dis R — a, and so 

d G H({xi },{yi}) < ~a. 



By the stability theorem of Chazal, et. al. 3.1] (Theorem 2.9 above), this implies 
that the probability that 

d B (PTlk{{xi}),PTik({yi}))>e 
is bounded by ne. This further implies that the probability that 

d B {PK k {{x t }),PK k ({ yi })) > ne 
is also bounded by ne. Therefore, we can conclude that 

d Pr (^{X,d x ^ x )^l{Y 1 d Y ^ Y ))<ne. □ 



One of the consequences of Theorem 5.2 is that <£>£ provide robust descriptors for 
metric measure spaces (X,dx,fJ-x)- Specifically, observe that if we have (X, dx) C 
(X ,dx') and a probability measure fj,x' that restricts to fix on X, then 

dp r {i*Hx,Vx>) < l-fJLX'(X). 
Thus, when X' \ X has probability < e, 

d Pr {$>l{X, dx^x), d' x ,nx')) < ne. 

In particular, when X and X' are finite metric spaces with the uniform measure, 
we get 

d Pr (n(X, d x ,nx),mx'> d'x,Hx>)) < n(l - \X\/\X'\). 
As an immediate consequence we obtain the following result. 

Theorem 5.3. For fixed n,k, is uniformly robust with robustness coefficient r 
and estimate bound nr/(l + r) for any r. 

To study more general metric measure spaces, the Glivenko-Cantelli theorem 
implies that consideration of large finite samples will suffice. 

Lemma 5.4. Let Si C S2 C . . . C Si C . . . be a sequence of randomly drawn 
samples from [X,dx, Hx)- We regard Si as a metric measure space using the 
subspace metric and the empirical measure. Then ^(Si) converges almost surely 
to$l((X,d x ,»x)). 

Proof. This is a consequence of the fact that {Si} converges almost surely to 
(X, dx,H>x) m the Gromov-Prohorov metric (which can be checked directly or 
for instance follows from the analogous result for the Gromov-Wasserstein dis- 
tance [HI 3.5. (iii)] and the comparison between the Gromov-Prohorov distance 
and the Gromov-Wasserstein distance [TTJ 10.5]. □ 
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Remark 5.5. It would be useful to prove analogues of the main theorem for other 



methods of assigning complexes; e.g., the witness complex (see Remark 2.4 and 
Section [8]). 

6. Hypothesis testing, confidence intervals, and numerical invariants 

In this section, we discuss hypothesis testing and the closely related issue of 
confidence intervals for our barcode distribution invariants. The basic goal is to 
provide quantitative ways of saying what an observed empirical barcode distribution 
means. The model for hypothesis testing is that we have two empirical distributions 
(obtained from some kind of sampling) and we want to estimate the probability that 
they were drawn from the same distribution. We demonstrate how to do this for 
m some synthetic examples in the next section. In Section [8) we apply this to 
validate part of the analysis of the natural images dataset in (SfT 

In our setting we cannot assume anything about the class of possible hypotheses 
and so we are forced to rely on non-parametric methods. Most results on non- 
parametric tests for distribution comparison work only for distributions on R and 
the first step is to project the data into this framework. In our examples, we use 
two kinds of projections. 

Definition 6.1. Let (X, dx, Hx) be a compact metric measure space. Fix fc, n G N. 
Define the distance distribution V 2 on R to be the distribution on R induced by 
applying d B {— , — ) to pairs (bi,b 2 ) drawn from dx, Hx)® 2 ■ Let B be a fixed 

barcode in B, and define T>b to be the distribution induced by applying d&(B, — ). 

We will demonstrate hypothesis testing for comparison of distributions using 
these projections and the two-sample Kolmogorov-Smirnov statistic IS.j §6]. This 
test gives a way to determine whether two observed empirical distributions were 
obtained from the same underlying distribution; moreover, for distributions on R 
the p-values of the test statistic are independent of the underlying distribution as 
long as the samples are identically independently drawn. 

To compute the Kolmogorov-Smirnov test statistic for two sets of samples J?i 
and £2 , we first compute the empirical approximations £\ and £2 to the cumulative 
density functions, 

E i (t) = \{xeS i \x<t}\/\S i \ ) 

and use the test statistic sup t \£i(t) — £i{b)\. (In practice, \Si\ is large and we 
approximate Si using Monte Carlo methods.) Standard tables (e.g., in the appendix 
to [B]) or the built-in Matlab functions can then be used to compute p- values for 
deciding if the statistic allows us to reject the hypothesis that the distributions are 
the same. 

Remark 6.2. Our choice of the Kolmogorov-Smirnov test was arbitrary and for 
purposes of illustration; one might also consider the Mann- Whitney test or various 
other nonparametric techniques for testing whether samples were drawn from the 
same underlying distribution. 

As a second demonstration of hypothesis testing, we can apply a y 2 test to dis- 
crete distributions constructed by the following procedure. Fix a finite set {Bj} C B 
of reference barcodes, where 1 < j < m. For each barcode with nonzero probabil- 
ity measure in (the given empirical approximation to) assign the count to the 
nearest reference barcode. (In general, we compute the empirical approximation to 
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using Monte Carlo methods.) Let Ai(j) denote the count for reference barcode 
Bj in sample i (for i = 1, 2). The test statistic in the test is defined to be 

X A 1 {j) + A 2 { 3 ) ■ 

As the notation suggest, asymptotically this has a y 2 distribution with m' — 1 
degrees of freedom (where m! is the number of reference barcodes with nonzero 
counts) . As such, we can again look up the p- values for this distribution in standard 
tables. 

Remark 6.3. Independence of distribution for p- values above hold asymptotically 
as a consequence of the Glivenko-Cantelli theorem; however, in general we may 
worry if our sample sizes are large enough for the p- value to be correct. 

Another approach it to use summary test statistics. A very natural test statistic 
associated to measures the distance to a fixed hypothesis distribution. 

Definition 6.4. Let (X,dx, px) be a compact metric measure space and let V be 
a fixed reference distribution on B. Fix k,n £ N. Define the homological distance 
on X relative to V to be 

HD£((X,dx,Mx),P) = d Pr ($l(X,dx,»x),V). 



The proof of Lemma |5.4| applies to show that large finite samples suffice to 
approximate HD)J. 

Lemma 6.5. Let S\ C 52 C . . . C Si C . . . be a sequence of randomly drawn 
samples from (A, dx,Hx)- We regard Si as a metric measure space using the 
subspace metric and the empirical measure. Then for V a fixed reference distribution 
on B, HD^(Si,V) converges almost surely to HD)J((A, dx, Px), V). 

An immediate consequence of Theorem |5.2| is the following robustness result 



(paralleling Theorem 5.3 1 



Theorem 6.6. For fixed n,k,V ', HD^(— ,V) is uniformly robust with robustness 
coefficient r and estimate bound nr/(l + r) for any r. 

Alternatively, a virtue of distributions on M is that they can be naturally summa- 
rized by moments; in contrast, moments for distributions on barcode space are very 
hard to understand (for example, geodesies between close points are not unique). 
The first moment is the mean, which could be used as a test statistic. Because we 
have emphasized robust statistics, we work instead with the median (or a trimmed 



mean, see Remark 6.9) and introduce the following test statistic: 



Definition 6.7. Let (X, dx,px) be a compact metric measure space and let V be 
a fixed reference barcode B € B. Fix k, n £ N. Let T> denote the distribution on R 
induced by applying d&(B, — ) to the barcode distribution $£(A, dx,Px)- Define 
the median homological distance relative to V to be 

MHD^{{X,dx,fJ,x),V) = median^). 

Again, the Glivenko-Cantelli theorem implies that consideration of large finite 
samples will suffice. 



18 



BLUMBERG, GAL, MANDELL, AND PANCIA 



Lemma 6.8. Let Si C S2 C . . . C Si C . . . be a sequence of randomly drawn sam- 
ples from (X, dx,/J-x)- We regard Si as a metric measure space using the subspace 
metric and the empirical measure. Let V be a fixed reference barcode, and assume 
that T>((X, dx , /J-x),'P) has a distribution function with a positive derivative at the 
median. Then MHD^(Si, V) almost surely converges to MHD^((X, dx, /ix)jP)- 



Proof. As in the proof of Lemma 5.4 the fact that {S{\ converges almost surely to 
(X, dx,fJ>x) m the Gromov-Prohorov metric implies that T>(Si) weakly converges 
to T>. Now the central limit theorem for the sample median (see for instance [501 
III. 4. 24]) implies the convergence of medians. □ 



Remark 6.9. To remove the (possibly hard to verify) hypothesis in Lemma 6.8 one 
can replace the median with a trimmed mean (i.e., the mean of the distribution 
obtained by throwing away the top and bottom k%, for some constant k). 

As discussed in the introduction, a counting argument yields the following ro- 
bustness result. 

Theorem 6.10. For any n,k,V, the function MHD£(— ,V) from finite metric 
spaces (given the uniform probability measure) to K is robust with robustness coef- 
ficient > (ln2)/n. 

A particular advantage of HDjJ and MHDJ! is that we can define confidence 
intervals using the standard non-parametric techniques for determining confidence 
intervals for the median [§1 §7.1]. Specifically, we use appropriate sample quantilcs 
(order statistics) to determine the bounds for an interval which contains the actual 
median with confidence I — a. For example, a simple approximation can be obtained 
from the fact that order statistics asymptotically obey binomial distributions, which 
lead to the following definition using the normal approximation to the binomial 
distribution. 

Definition 6.11. Let (X, dx,Hx) be a metric measure space, V a fixed barcode, 
and V a fixed distribution on barcodes. Fix < a < 1 and n, k. Given m empirical 
approximations to HD^(— ,V) or MHD]?(— , V), let {s m } denote the samples sorted 
from smallest to largest. Let u a denote the ^ significance threshold for a standard 
normal distribution. The 1 — a confidence interval for the sample median is given 
by the interval 

s L 2 r 1 -|AM««J' s r !5 ^ i +|AM«c«l 

Note that a further advantage of HD£ and MHD)! is that the asymptotics we 
rely on for the confidence intervals in the preceding definition do not depend on the 
convergence result in the main theorem, and as a consequence we can apply these 
confidence intervals when studying filtered complexes generated by a procedure 
other than the Rips complex, e.g., the witness complex. We give examples in 
Section [8] of both sorts of confidence interval. 

More sophisticated estimates involving better techniques for non-parametric con- 
fidence intervals for the population median can also be adopted. Of course, such 
non-parametric estimates depend on the convergence to the binomial approxima- 
tion, which we cannot know a priori (cf. Remark |6.3| above). Furthermore, this 
requires empirical estimates of the distribution $£(A, dx, Ma)- But provided this 
is available, we can perform confidence-interval based hypothesis testing. 
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Remark 6.12. All of the inference procedures described in this section require re- 
peated sampling of samples of size k. This raises the question that, given the 
opportunity to obtain m samples of size k, is it more sensible to regard this as 
m samples of size of size k or one sample of size mk (or something in between)? 
Considerations of the latter sort quickly lead to consideration of bootstrap and re- 
sampling methods; we believe that bootstrap confidence intervals are most likely 
to be useful in this setting. We study resampling, proving asymptotic consistency 
and studying the convergence behaviors in the companion paper [T]. 

Finally, in principle we would like to be able to use the distribution invariant 
$k(X, dx,[ix) (or perhaps other distributions of barcodes) in likelihood statistics. 
For example, given a hypothesis barcode B, we can empirically estimate the prob- 
ability that one would see a barcode within e of B when sampling n points. This 
then in principle allows us to test whether a particular sample (or collection of 
samples) is consistent with the hypothesis Hyp)! (A; B, e) that B is within e of the 
barcode of a sample of size n drawn from (X, dx,Hx)- More generally, we can 
distinguish between (X,dx, Hx') and (X', dx 1 , Hx') by calculating likelihoods of 
the hypotheses Hyp£(X; B,e) and Hypj?(A'; B, e). Specifically, given an observed 
barcode B obtained by sampling n points from an unknown metric measure space 
{Z, dz, Hz), we can compute the likelihood 

L x = L(X,d x ,Hx) = Pr(d B (B,B) <e\B drawn from ^(X,d x ,Hx))- 

The ratio Lx/Lx' provides a test statistic for comparing the two hypotheses. 
A significant difficulty with this approach, however, is that in order to compute 
the thresholds on Lx/Lx> for deciding which hypothesis to accept at a given p- 
value, we require knowledge of the distribution of the likelihood scores induced by 
$£(A, dx,Hx) and dx',fJ>x'), which in general must be obtained by Monte 

Carlo simulation (i.e., repeated sampling). 

We can overcome the difficulty above by instead testing the likelihood of a more 
distributional statement. For a metric measure space {X, dx, fJ-x) and a subset S of 
B, we can estimate the likelihood that the distribution $>V;(X) has mass > e on S as 
follows. For any hypothetical distribution on B with mass > e on S, the probability 
of an empirical sample of size N having q or fewer elements in S is bounded above 
by the binomial cumulative distribution function 



Then given an empirical approximation £ to obtained from N samples, we can 
test the hypothesis that has mass > e in S, by taking q to be the number of 
such elements in £ . When BD(iV, q, e) < a, we can reject this hypothesis at the 
1 — a level. 

Remark 6.13. In the test statistics in this section, we have suggested using Monte 
Carlo methods to estimate distributions but not to estimate p- values. The 
reason is that that the space of possible null hypotheses is often so large that it is 
infeasible to imagine doing the required Monte Carlo simulations. 

7. Demonstration of hypotheses testing on synthetic examples 

Synthetic Example 1: The annulus and the annulus plus diameter linage. 

To demonstrate hypothesis testing methodology for we first consider a simple 
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example which illustrates the robustness of the distributional invariants. We gen- 
erated a set Si of 1000 points by sampling uniformly (via rejection sampling) from 
an annulus of inner radius 0.8 and outer radius of 1.2 in M. 2 (see Figure [lj. The 
underlying manifold is homotopy equivalent to a circle, and computing the bar- 
code for the first homology group (with cutoff of 0.75) yields a single long interval, 
displayed in Figure [2j 




Figure 1. Annulus with inner radius 0.8 and outer radius 1.2 
(left), and same annulus together with diameter linkage given by 
the nine points (0,0.8), (0,0.6), (0,0.4), (0,0.2), (0,0), (0,-0.2), 
(0,-0.4), (0,-0.6), (0,-0.8) (right). 




Figure 2. Barcode for annulus via the Rips complex with 1000 
points. Horizontal scale goes from to 0.75. (Vertical scale is not 
meaningful.) 

We then added the set of points X 

{(0, 0.8), (0, 0.6), (0, 0.4), (0, 0.2), (0, 0), (0, -0.2), (0, -0.4), (0, -0.6), (0, -0.8)} 

to form the set 52 = Si U X. By adding these 9 points, the point cloud now 
appears to have been sampled from an underlying manifold homotopy equivalent 
to a figure 8 when the parameter is between 0.2 and our cutoff 0.75. (See Figurell]) 
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FIGURE 3. Barcode for the annulus plus diameter linkage via the 
Rips complex with 1000 points. Horizontal scale goes from to 
0.75. (Vertical scale is not meaningful.) 



Computing the barcode for the first homology group now yields two long intervals, 
displayed in Figure [3j 

We computed empirical approximations to $p(<Si) and $i 5 (<S 2 ), using 1000 sam- 
ples of size 75 and barcode cutoffs of 0.75. Comparing the empirical distance distri- 
butions T> 2 (as in Definition 6.1 1 using the Kolmogorov-Smirnov statistic suggested 



they were drawn from the same distribution at the 95% confidence level. Fixing 
a reference barcode B\ with a single long bar and using the associated distance 
distribution produced the same result. 

For the x 2 test on these samples, we found that the resulting distributions had 
nontrivial mass clustered in three regions: around a barcode B with no long inter- 
vals, a barcode B\ with one long interval, and a barcode B2 with two long intervals. 
Assigning each point in the empirical approximation to the nearest such barcode, 
the distribution of masses were 0.236, 0.749, and 0.015 for Si and 0.234, 0.741, and 
0.025 for 6>2- The y 2 test suggested they were drawn from the same distribution at 
the 95% confidence level. 

This example also begins to illuminate a relationship between the distributional 
invariants and density filtering. Notice that the second interval at the bottom of 
Figure [3] starts somewhat later, reflecting a difference in average interpoint distance 
between the original samples and the additional points added. As a consequence, 
one might imagine that appropriate density filtering would also detect these points. 
On the one hand, in many cases density filtering is an excellent technique for con- 
centrating on regions of interest. On the other hand, it is easy to construct examples 
where density filtering fails (for instance, we can build examples akin to the one 
studied here where the "connecting strip" has comparable density to the rest of 
the annulus simply by reducing the number of sampled points or by expanding the 
outer radius while keeping the number of sampled points fixed). More generally, 
studying distributional invariants (such as $) by definition allows us to integrate 
information from different density scales. In practice, we expect there to be a syn- 
ergistic interaction between density filtering and the use of see Section [8] for an 
example. 



Synthetic Example 2: Friendly circles. Next, we considered a somewhat more 
complicated example. We sampled 750 points uniformly in the volume measure from 



22 



BLUMBERG, GAL, MANDELL, AND PANCIA 



the circle centered at (0, 0) of radius 2, and 750 points uniformly in the volume 
measure from the circle centered at (0.8,0) of radius 1. We then added uniform 
noise sampled from [-2,2] x [-2,2] C R 2 . (See Figure Without noise, we 
saw the expected pair of long bars in the barcode for the first persistent homology 
group, computed using the Rips complex. However, as noise was added, the results 
of computing barcodes using the Rips complex degraded very rapidly, as we see in 
Figure [5] 




Figure 4. Two circles with noise (indicated by gray box). 

Even with only 10 noise points, we see 3 bars, and with 90 noise points there 
are 12. (These results were stable across different samples; we report results for a 
representative run.) 

We computed $J 00 for the same point clouds (i.e., the two circles plus varying 
numbers of noise points), using 1000 samples of size 300 and a cutoff of 0.75. The 
resulting empirical distributions had essentially all of their weight concentrated 
around barcodes with a small number of long intervals. We clustered the points 
in the empirical estimate of $^ 00 around barcodes with fixed numbers of long bars 
(from to the cutoff) , assigning each point to the nearest barcode. The results are 
summarized in Table 1 below. 



Table 1. Distribution summaries for 



Number of noise pts 


bars 


1 bars 


2 bars 


3 bars 


4 bars 


5 bars 
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696 


1 
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305 


589 


106 
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590 


132 
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285 


594 


119 


2 





40 
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259 


584 


149 
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50 





289 


553 


154 


4 





GO 





254 


591 


146 


7 


2 


70 





277 


564 


154 


5 





80 


1 


229 


543 


196 


29 


2 


90 





229 


533 


207 


28 
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Figure 5. Barcode for two circles with 10, 50, and 90 noise points. 
Horizontal scale goes from to 0.75. (Vertical scale is not mean- 
ingful.) 



A glance at the table shows that the majority of the weight is clustered around a 
barcode with 2 long bars and that the data overwhelming supports a hypothesis of 
< 3 barcodes under all noise regimes. We can be more precise using the likelihood 
statistic of the last section to evaluate the hypothesis H that the observed empiri- 
cal approximation to was drawn from an underlying barcode distribution with 
weight > 5% on barcodes with more than 3 long bars. In the strictest tests with 80 
and 90 noise points, 31 out of 1000 samples were near bar codes with more than 3 
long bars, and so we estimate that the probability of the distribution having > 5% 
of the mass at 4 or more bar codes as < BD(1000, 31, .05) < 0.22%. Put another 
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way, we can reject the hypothesis that the actual distribution has more than 5% 
mass at 4 or more bar codes at the 99.7% level. 

8. Application: confidence intervals for the natural images dataset 

8.1. Setup. In this section, we compute the confidence intervals based on MHD£ 
for a subset of patches from the natural images data set as described in [3] . We 
briefly review the setup. The dataset consists of 15000 points in K 8 , generated as 
follows. From the natural images, 3x3 patches (dimensions given in pixels) were 
sampled and the top 20% with the highest contrast were retained. These patches 
were then normalized twice, first by subtracting the mean intensity and then scaling 
so that the Euclidean norm is 1. The resulting dataset can be regarded as living 
on the surface of an S 7 embedded in R 8 . After performing density filtering (with 
a parameter value of k = 15; refer to [3] for details) and randomly selecting 15000 
points, we are left with the dataset ./Vf(15,30). At this density, one tends to see 
a barcode corresponding to 5 cycles in the H\, In the Klein bottle model, these 
cycles are generated by three circles, intersecting pairwise at two points (which can 
be visualized as unit circles lying on the xy-plane, the yz-plane, and the res-plane). 

8.2. Results. We computed an empirical approximation to $f 00 (A / J(15, 30)) and 
found that (after clustering, as above) the weight was distributed as 0.1 % with one 
bar, 1.1 % with two bars, 7.4% with three bars, 34.2% with four bars, and 57.2% 
with five bars. Analyzing likelihoods, we can see that the underlying distribution 
has at least 95% of its mass on two, three, or four bars at the the 99.7% confidence 
level. 

We also analyze the results using MHD. We use as the hypothesis barcode the 
multi-set {(a, c), (a, c), (a, c), (a, c), (a, c)}, where a is chosen to be smaller than the 
minimum value at which any signal appears in the first Betti number and c is the 
maximum bar length found in the dataset (and is typically an arbitrary value ob- 
tained as the cutoff for the maximal filtration value; we used the value reported 
in [3], which is 2). We approximate MHDf 00 (A^(15, 30)) by the empirical distri- 
bution based on 1000 samples. We find using the non-parametric esimate based 
on the interval statistics that the 95% confidence interval for MHDf 00 (A^(15, 30) is 
[0.442, 0.476]. The 99% confidence interval for MHDf °(.M(15, 30)) is [0.436, 0.481]. 
These results represents high confidence for the data to be further than 0.442 but 
closer than 0.476 to the reference barcode. On the other hand, when we compute 
the confidence intervals using the reference barcode the empty set, we find that 
both the 95% and 99% confidence intervals are the cutoff value of 2. We find 
similar results for hypothesis barcodes with fewer than five bars. 

We interpret these results to suggest that the hypothesis barcode is the best 
summarization amongst barcode distributions that put all of their mass on a single 
barcode. Of course, these results also suggest that when sampling at 500 points, 
we simply do not expect to see a distribution that is heavily concentrated around a 
single barcode. In the next subsection, we discuss the use of the witness complex, 
which does result in such a distribution. 

Remark 8.1. To validate the non-parametric estimate of the confidence interval, 
we also used bootstrap resampling to compute bootstrap confidence intervals. Al- 
though we do not justify or discuss further this procedure herein (see instead pQ), 
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we note that we observed the reassuring phenomenon that the bootstrap confi- 
dence intervals agreed closely with the non-parametric estimates for both the 95% 
confidence intervals and the 99% confidence intervals in each instance. 

8.3. Results with the witness complex. Because of the size of the datasets 
involved, in the analysis performed in [3], rather than the Rips complex VR they 
used the weak witness complex. The weak witness complex for a metric space (X, d) 
depends on a subset W C X of witnesses; the size of the complexes is controlled 
by \W\ and not \X\. 

Definition 8.2. For e e R, e > and witness set W C X, the weak witness complex 
W e (X, W) is the simplicial complex with vertex set W such that [vq, v i, . . . , v n ] is 
an n-simplex when for each pair Vi,Vj, there exists a point p G X (a witness) such 
that the distances d(vi,p) < e. 

When working with the witness complex, we adapt our basic approach to study 
the induced distribution on barcodes which comes from fixing the point cloud and 
repeatedly sampling a fixed number of witnesses. The theoretical guarantees we 
obtained for the Rips complex in this paper do not apply directly; we intend to 
study the robustness and asymptotic behavior of this process in future work. Here, 
we report preliminary numerical results. 

Again, we use as the hypothesis barcode the multi-set {(0,c), (0,c), (0, c), (0, c), 
(0, c)} as above. We approximated MHD"(A4(15, 30)) (various n) by the empirical 
distribution obtained by sampling n landmark points from .M(15,30), computing 
the barcode, and computing the distance to the reference barcode. Doing this 1000 
times, we find using the non-parametric esimate based on the interval statistics that 
the 95% confidence interval for MHDj 00 (X(15, 30)), is [0.024, 0.027]. The 99% con- 
fidence interval for MHDj 00 (X(15, 30)) was also [0.024, 0.027]. When we computed 
the 95% confidence interval for MHDj 50 (.M(15, 30) we obtained [0.021, 0.023]. The 
99% confidence interval for MHD} 50 (7W(15, 30)) was [0.021,0.024]. This represents 
high confidence for the data to be further than 0.021 (for ${ 50 ) and 0.024 (for $\ 00 ) 
but closer than 0.024 (for $} 50 ) and 0.027 (for to the reference barcode. We 

obtained essentially the same results MHDf° as for MHD} 50 . We interpret these 
results to mean that the underlying distribution is essentially concentrated around 
the hypothesis barcode; the distance of 0.025 is essentially a consequence of noise. 

Remark 8.3. In contrast, when we compute MHDf (7W(15,30)), we find the con- 
fidence interval is [1.931,1.939]. When we compute MHD[ 5 (7W(15, 30)), we find 
that the confidence interval is [1.859,1.866]. This represents high confidence that 
MHDf and MHDj' 5 are far from this reference barcode, which in light of the con- 
fidence intervals above for MHD} 50 and MHD^ 00 appear to indicate that samples 
sizes 25 and 75 are too small. 
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